开发者社区> 问答> 正文

从数据集Python获取缺少的日期时间-DataSeries-

我在互联网上进行搜索,但在代码的一部分中找到了类似的问题,但是由于我的声誉,我无法向发布代码的人添加评论。我有一个由DateTime-Value组成的测试数据集;这些值是由传感器每分钟获得的,但是传感器并不完美,因此我在同一时间段内有两个文件,但文件的长度和样式不同,因为有时它在同一分钟内两次测量。

当前的测试数据集在第10分钟没有数据,因此下面的程序应向我显示该行丢失。

import pandas as pd
import numpy as np

# testing data
data = pd.read_excel("testing.xlsx")

# Set Attribute Date to Pandas Datetime
data['Date'] = pd.to_datetime(data['Date'])
# Round Time into minutes
data['Date'] = pd.Series(data['Date']).dt.round("1min")
# Remove duplicates keeping one
data.drop_duplicates(subset ="Date", keep = "first", inplace = True)
# Change date order, because date_range increment month and not day
date['Date'] = data['Date'].dt.strftime('%d-%m-%Y %H:%M:%S')
# Data Range between initial date and final date 
date_range = pd.date_range(date['Date'][date.index[0]], date['Date'][date.index[-1]], freq='1Min')

# Transform the column (it's a string) to datetime type
dt = pd.to_datetime(date['Date'])

# create datetime index passing the datetime series
dato = pd.DatetimeIndex(dt.values)

df = pd.DataFrame(np.random.randint(1, 20, (dato.shape[0], 1)))
df.index = dato  # set index

df_missing = df.drop(df.between_time('02:12', '02:14').index)

#check for missing datetimeindex values based on reference index (with all values)
missing_dates = df.index[~df.index.isin(date.index)]

print(missing_dates)

当前输出为:

DatetimeIndex(['2019-04-01 02:00:00', '2019-04-01 02:01:00',
               '2019-04-01 02:02:00', '2019-04-01 02:03:00',
               '2019-04-01 02:04:00', '2019-04-01 02:05:00',
               '2019-04-01 02:06:00', '2019-04-01 02:07:00',
               '2019-04-01 02:08:00', '2019-04-01 02:09:00',
               '2019-04-01 02:11:00', '2019-04-01 02:12:00',
               '2019-04-01 02:13:00', '2019-04-01 02:14:00',
               '2019-04-01 02:15:00', '2019-04-01 02:16:00'],
              dtype='datetime64[ns]', freq=None)

什么时候应该显示如下内容:

 DatetimeIndex(['2019-04-01 02:10:00'],
              dtype='datetime64[ns]', freq=None)

问题来源:stackoverflow

展开
收起
is大龙 2020-03-24 00:13:33 539 0
1 条回答
写回答
取消 提交回答
  • 不知道为什么需要执行所有重新索引等操作-也许您可以对此进行澄清。指定的date_range的比较可以很好地工作,如下所示:

    import pandas as pd
    import numpy as np
    
    # create sample data
    df = pd.DataFrame({'t': ['2019-04-01 02:00:00', '2019-04-01 02:01:00',
                             '2019-04-01 02:02:00', '2019-04-01 02:03:00',
                             '2019-04-01 02:04:00', '2019-04-01 02:05:00',
                             '2019-04-01 02:06:00', '2019-04-01 02:07:00',
                             '2019-04-01 02:08:00', '2019-04-01 02:09:00',
                             '2019-04-01 02:11:00', '2019-04-01 02:12:00',
                             '2019-04-01 02:13:00', '2019-04-01 02:14:00',
                             '2019-04-01 02:15:00', '2019-04-01 02:16:00'],
                       'y': np.arange(16)})
    df['t'] = pd.to_datetime(df['t'])
    
    # set the time column as index
    df = df.set_index(['t'])
    
    # target: date_range 1 min steps
    tgt = pd.date_range(df.index[0], df.index[-1], freq='min')
    
    # now use .isin() and negate to get the timestamps that are missing in df
    print(tgt[~tgt.isin(df.index)])
    # DatetimeIndex(['2019-04-01 02:10:00'], dtype='datetime64[ns]', freq='T')
    

    回答来源:stackoverflow

    2020-03-24 00:13:40
    赞同 展开评论 打赏
问答分类:
问答标签:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
From Python Scikit-Learn to Sc 立即下载
Data Pre-Processing in Python: 立即下载
双剑合璧-Python和大数据计算平台的结合 立即下载