1 打开示例数据和网络数据
1-1 打开示例数据:use
sysuse auto,clear (1978 Automobile Data)
1-2 从网络获取数据:webuse or use
use nlswork, clear file nlswork.dta not found r(601); * 获取网络数据,从网站获取数据 use http://www.stata-press.com/data/r9/nlswork (National Longitudinal Survey. Young Women 14-26 years of age in 1968) * 与前一命令等价,从 STATA 官方数据库获取数据 webuse nlswork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968)
- webuse只能从 http://www.stata-press.com/data这一路径获取数据,如果不是该网站的数据,webuse失效,只能把网站地址完全写出来。使用该命令时必须确保网络连接正常
- 另一个网络数据较多的地方是波士登大学的数据中心,伍德里奇的《计量经
济学导论》一书中所使用的全部数据都可以通过该数据中心获得。比如
. use http://fmwww.bc.edu/ec-p/data/wooldridge/CEOSAL1
use 命令只能打开后辍名为“*.dta”格式的数据,.dta 格式以外的数据,STATA不能直接读取,需要从外部读入,最简单而直接的办法是复制和粘贴。
但是有时没有其他软件,比如,我们有 SAS 格式或 SPSS 格式的数据,但没有 SAS 软件和 SPSS 软件,此时需要用 STATA 提供的其他命令或者使用 transfer 数据格式转化软件。
2 数据类型
STATA 通常把变量划分为三类:分别是数值型,字符型和日期型
2-1 数值变量
. dis 5 5 . dis -5 -5 . dis 5.2 5.2 . dis 5.2e+3 5200 . dis 5.2e-2 .052
数值型变量按其精度区分,又有五种类型byte | int | float | long | double
存贮类型 | 最小 | 最大 | 0-领域 | 字节 |
byte | -127 | 100 | +/-1 | 1 |
int | -32,767 | 32,740 | +/-1 | 2 |
long | -2,147,483,647 | 2,147,483,620 | +/-1 | 4 |
float | -1.70141173319*10^38 | 1.70141173319*10^36 | +/-10^-36 | 4 |
double | -8.9884656743*10^307 | 8.9884656743*10^307 | +/-10^-323 | 8 |
//将设定一个观察值 . set obs 1 number of observations (_N) was 0, now 1 //提示信息说,之前系统中没有观察单位,现在有了一个 . gen a=1 . d /*d 为 describ 命令的略写,describ 命令显示数据集的 属性信息,注意观察显示结果中,a 的 storage type 为 float 型, 浮点型为默认类型*/ Contains data obs: 1 vars: 1 size: 8 --------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- a double %10.0g --------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. //在不损害信息的基础上压缩,使数据占用空间尽可能小 . compress variable a was double now byte (7 bytes saved) . d Contains data obs: 1 vars: 1 size: 1 --------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- a byte %10.0g --------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . replace a=32741 variable a was byte now long (1 real change made) . gen double b=1 . recast double a . d Contains data obs: 1 vars: 2 size: 16 --------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- a double %10.0g b double %10.0g --------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. //注意到 a 和 b 均为双精度型
2-2 字符串变量
字符串最多可以达 244 个字符。一般用 str#来表示字符的多少,如 str20
表示将有 20 个字符。一般三个中文字的姓名需要 6 个字符。
gen c="123.5" . d c storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- c str5 %9s . destring c must specify either generate or replace option r(198); . destring c,replace c has all characters numeric; replaced as double . d c storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- c double %10.0g
2-3 日期型变量
在 STATA 中,1960 年 1 月 1 日被认为是第 0 天,因此 1959 年 12 月 31 日为第-1 天,2001 年 1 月 25 日为 15000 天.
Types of dates and their human readable forms (HRFs) Date type Examples of HRFs -------------------------------------------- datetime 20jan2010 09:15:22.120 date 20jan2010, 20/01/2010, ... weekly date 2010w3 monthly date 2010m1 quarterly date 2010q1 half-yearly date 2010h1 yearly date 2010 --------------------------------------------
. display %d date("20060125", "YMD") 25jan2006 . display %td date("060125", "20YMD") 25jan2006 . display %td date("060125", "19YMD") 25jan1906 . display %tc clock("20231101113159", "YMDhms") 01nov2023 11:31:59 //*计算自己的精确年龄 . di (mdy(9,18,2023)-mdy(2,27,2002))/365.25 21.555099
*其他时间函数
. sysuse sp500, clear (S&P 500) . des Contains data from C:\Program Files (x86)\Stata14\ado\base/s/sp500.dta obs: 248 S&P 500 vars: 7 22 Apr 2014 10:52 size: 7,440 (_dta has notes) --------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- date int %td Date open float %9.0g Opening price high float %9.0g High price low float %9.0g Low price close float %9.0g Closing price volume double %12.0gc Volume (thousands) change float %9.0g Closing price change --------------------------------------------------------------------------------------------------- Sorted by: date . gen d=day(date) . gen w=week(date) . gen m=month(date) . gen q= quarter(date) . gen hy= halfyear(date) . gen y=year(date) . gen ndate1=mdy(m,d,y) . gen weekd=dow(date) . gen yeard=doy(date) . des Contains data from C:\Program Files (x86)\Stata14\ado\base/s/sp500.dta obs: 248 S&P 500 vars: 16 22 Apr 2014 10:52 size: 25,296 (_dta has notes) --------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- date int %td Date open float %9.0g Opening price high float %9.0g High price low float %9.0g Low price close float %9.0g Closing price volume double %12.0gc Volume (thousands) change float %9.0g Closing price change d double %10.0g w double %10.0g m double %10.0g q double %10.0g hy double %10.0g y double %10.0g ndate1 double %10.0g weekd double %10.0g yeard double %10.0g --------------------------------------------------------------------------------------------------- Sorted by: date Note: Dataset has changed since last saved.
2-4 缺失值
display 2/0 .
3 数据类型转化
. *----------------将字符型数据转换为数值型数据:去掉字符间的空格------------ . webuse destring2, clear . des Contains data from http://www.stata-press.com/data/r14/destring2.dta obs: 10 vars: 3 3 Mar 2014 22:50 size: 280 --------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- date str14 %10s price str11 %11s percent str3 %9s --------------------------------------------------------------------------------------------------- Sorted by: . destring date, replace date contains nonnumeric characters; no replace . destring date, replace ignore(“ ”) date: characters space removed; replaced as long . des date storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- date long %10.0g . list date +----------+ | date | |----------| 1. | 19991210 | 2. | 20000708 | 3. | 19970302 | 4. | 19990900 | 5. | 19981004 | |----------| 6. | 20000328 | 7. | 20000808 | 8. | 19971020 | 9. | 19980116 | 10. | 19991112 | +----------+ . destring price percent, gen(price2 percent2) ignore(“$ ,%”)
3- 2 数值型转化为字符型:tostring
#### 3-4 数据显示格式:format ```stata webuse census10,clear (1980 Census data by state) . des Contains data from http://www.stata-press.com/data/r14/census10.dta obs: 50 1980 Census data by state vars: 4 9 Apr 2014 08:05 size: 1,200 --------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------- state str14 %14s State region int %8.0g cenreg Census region pop long %11.0g Population medage float %9.0g Median age --------------------------------------------------------------------------------------------------- Sorted by: . list in 1/5 +-----------------------------------------+ | state region pop medage | |-----------------------------------------| 1. | Alabama South 3893888 29.3 | 2. | Alaska West 401851 26.1 | 3. | Arizona West 2718215 29.2 | 4. | Arkansas South 2286435 30.6 | 5. | California West 23667902 29.9 | +-----------------------------------------+ . format state %-14s . list in 1/5 +-----------------------------------------+ | state region pop medage | |-----------------------------------------| 1. | Alabama South 3893888 29.3 | 2. | Alaska West 401851 26.1 | 3. | Arizona West 2718215 29.2 | 4. | Arkansas South 2286435 30.6 | 5. | California West 23667902 29.9 | +-----------------------------------------+ . format pop %14.0gc . list in 1/5 +-------------------------------------------+ | state region pop medage | |-------------------------------------------| 1. | Alabama South 3,893,888 29.3 | 2. | Alaska West 401,851 26.1 | 3. | Arizona West 2,718,215 29.2 | 4. | Arkansas South 2,286,435 30.6 | 5. | California West 23,667,902 29.9 | +-------------------------------------------+ . format medage %8.2f . list in 1/5 +-------------------------------------------+ | state region pop medage | |-------------------------------------------| 1. | Alabama South 3,893,888 29.30 | 2. | Alaska West 401,851 26.10 | 3. | Arizona West 2,718,215 29.20 | 4. | Arkansas South 2,286,435 30.60 | 5. | California West 23,667,902 29.90 | +-------------------------------------------+ . gen id=_n . list in 1/6 +------------------------------------------------+ | state region pop medage id | |------------------------------------------------| 1. | Alabama South 3,893,888 29.30 1 | 2. | Alaska West 401,851 26.10 2 | 3. | Arizona West 2,718,215 29.20 3 | 4. | Arkansas South 2,286,435 30.60 4 | 5. | California West 23,667,902 29.90 5 | |------------------------------------------------| 6. | Colorado West 2,889,964 28.60 6 | +------------------------------------------------+
3-5 在 STATA 中直接录入数据:input
. input id str10 name economy id name economy 1. . 1 John 40 2. 2 Chris 80 3. . 3 Jack 90 4. . 4 Huang 70 5. 5 Tom 53 6. . end . save economy file economy.dta saved . list +----------------------+ | id name economy | |----------------------| 1. | 1 John 40 | 2. | 2 Chris 80 | 3. | 3 Jack 90 | 4. | 4 Huang 70 | 5. | 5 Tom 53 | +----------------------+
3-6 导入其他格式数据:insheet
insheet using 3origin.csv, clear
当数据中某个变量的位数特别长或者对导入数据的精度要求很高的时候,需
要在该命令后面加 double 选项。
insheet using 3origin.txt, double clear
3-7
假设想在 D 盘的根目录下创建一个新的文件夹 mydata 来存放数据文件,命令为 mkdir。如果该文件夹已经存在,运行该命令时会出现错误信息,加上 capture后,STATA 会自动判断,如果 mydata 文件夹存在,则跳过该命令,如果不存在,则创建。
. capture mkdir D:/mydata . cd d:/mydata d:\mydata