在一个特定列上进行转换后,Apache Spark StringIndexerModel返回一个空数据集。我正在使用成人数据集:http://mlr.cs.umass.edu/ml/datasets/Adult
步骤1:创建StringIndexerModel并将其保存在本地
StringIndexerModel model = new StringIndexer().setInputCol(column).setOutputCol("label").setHandleInvalid("skip").setStringOrderType("alphabetAsc").fit(originalDataset);
model.write().save(filelocation);
步骤2:读取索引器模型并转换新数据集
StringIndexerModel model = StringIndexerModel.read().load(filelocation);
newDataset = model.transform(newDataset).drop(column).withColumnRenamed("label", column);
新数据集:
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
|age|capital gain|capital loss|education |education num|fnlgwt|hours per week|marital status |native country|occupation |race |relationship |sex |workclass |
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
|39 |2174 |0 | Bachelors|13 |77516 |40 | Never-married | United-States| Adm-clerical |White| Not-in-family|Male| State-gov |
|50 |0 |0 | Bachelors|13 |83311 |13 | Married-civ-spouse| United-States| Exec-managerial|White| Husband |Male| Self-emp-not-inc|
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
正确的输出:
Column: education | File Location: localFolder/stringIndex/education
Labels: [ 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, Doctorate, HS-grad, Masters, Preschool, Prof-school, Some-college]
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|marital status |native country|occupation |race |relationship |sex |workclass |education|
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
|39 |2174 |0 |13 |77516 |40 | Never-married | United-States| Adm-clerical |White| Not-in-family|Male| State-gov |9.0 |
|50 |0 |0 |13 |83311 |13 | Married-civ-spouse| United-States| Exec-managerial|White| Husband |Male| Self-emp-not-inc|9.0 |
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
Column: marital status | File Location: localFolder/stringIndex/marital status
Labels: [ Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed]
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|native country|occupation |race |relationship |sex |workclass |education|marital status|
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
|39 |2174 |0 |13 |77516 |40 | United-States| Adm-clerical |White| Not-in-family|Male| State-gov |9.0 |4.0 |
|50 |0 |0 |13 |83311 |13 | United-States| Exec-managerial|White| Husband |Male| Self-emp-not-inc|9.0 |2.0 |
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
Column: native country | File Location: localFolder/stringIndex/native country
Labels: [ ?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia]
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|occupation |race |relationship |sex |workclass |education|marital status|native country|
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
|39 |2174 |0 |13 |77516 |40 | Adm-clerical |White| Not-in-family|Male| State-gov |9.0 |4.0 |39.0 |
|50 |0 |0 |13 |83311 |13 | Exec-managerial|White| Husband |Male| Self-emp-not-inc|9.0 |2.0 |39.0 |
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
Column: occupation | File Location: localFolder/stringIndex/occupation
Labels: [ ?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving]
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|race |relationship |sex |workclass |education|marital status|native country|occupation|
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
|39 |2174 |0 |13 |77516 |40 |White| Not-in-family|Male| State-gov |9.0 |4.0 |39.0 |1.0 |
|50 |0 |0 |13 |83311 |13 |White| Husband |Male| Self-emp-not-inc|9.0 |2.0 |39.0 |4.0 |
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
输出错误:除此型号外,所有其他型号均正常工作
Column: race | File Location: localFolder/stringIndex/race
Labels: [ Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White]
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|relationship|sex|workclass|education|marital status|native country|occupation|race|
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。