tsql - SQL Server - remove bad data during an ETL -
i have etl imports readings flat file. readingdate on data good, it's readinghours suspect. how flag , remove errant readings?
here's code build sample table.
use [tempdb]; set nocount on; set xact_abort on; go begin transaction; use [tempdb] go create table [dbo].[readings]( [mrid] [bigint] null, [assetid] [int] not null, [readinghours] [decimal](18, 2) null, [readingdate] [date] null, [readingorderbyreadingdate] [bigint] null, [readingorderbyreadinghours] [bigint] null ); insert [dbo].[readings]([mrid], [assetid], [readinghours], [readingdate], [readingorderbyreadingdate], [readingorderbyreadinghours]) select 1, 84019, 1227.00, '20100518 00:00:00.000', 1, 1 union select 2, 84019, 1256.00, '20100519 00:00:00.000', 2, 2 union select 3, 84019, 1395.00, '20100915 00:00:00.000', 3, 4 union select 4, 84019, 1440.00, '20100930 00:00:00.000', 4, 5 union select 5, 84019, 1490.00, '20101119 00:00:00.000', 5, 6 union select 6, 84019, 1302.00, '20110228 00:00:00.000', 6, 3 union select 7, 84019, 1602.00, '20110309 00:00:00.000', 7, 7 union select 8, 84019, 1687.00, '20110630 00:00:00.000', 8, 8 union select 9, 84019, 8935.00, '20110914 00:00:00.000', 9, 27 union select 10, 84019, 1795.00, '20111014 00:00:00.000', 10, 9 union select 11, 84019, 1894.00, '20120321 00:00:00.000', 11, 10 union select 12, 84019, 1901.00, '20120330 00:00:00.000', 12, 11 union select 13, 84019, 1919.00, '20120425 00:00:00.000', 13, 12 union select 14, 84019, 1942.00, '20120606 00:00:00.000', 14, 13 union select 15, 84019, 2615.00, '20120921 00:00:00.000', 15, 26 union select 16, 84019, 2047.00, '20121023 00:00:00.000', 16, 14 union select 17, 84019, 2057.00, '20121031 00:00:00.000', 17, 16 union select 18, 84019, 2054.00, '20121108 00:00:00.000', 18, 15 union select 19, 84019, 2070.00, '20121126 00:00:00.000', 19, 17 union select 20, 84019, 2102.00, '20130211 00:00:00.000', 20, 18 union select 21, 84019, 2106.00, '20130214 00:00:00.000', 21, 19 union select 22, 84019, 2108.00, '20130219 00:00:00.000', 22, 20 union select 23, 84019, 2120.00, '20130315 00:00:00.000', 23, 21 union select 24, 84019, 2156.00, '20130520 00:00:00.000', 24, 22 union select 25, 84019, 2162.00, '20130530 00:00:00.000', 25, 23 union select 26, 84019, 2172.00, '20130610 00:00:00.000', 26, 24 union select 27, 84019, 2213.00, '20131121 00:00:00.000', 27, 25 commit; raiserror (n'[dbo].[readings]: insert batch: 1.....done!', 10, 1) nowait; go in instance, readings on lines: 6, 9, & 15 should removed. line 6 bad, period. line 9 bad based on fact lines 10 - 14 ever increasing , , represent better data. if imported line 9, rest of data skipped. line 15 considered bad because 16 - 27 ever increasing , considered readings (and same problem line 9, if import line 15, lines 16 through 27 skipped) bad values, way know bad @ of values in group.
i've found million different ways don't work, i'm point i'm rehashing bad ideas.
can fresh ones?
in sql server 2012 can use lead , lag below. example might bit overfitted sample posted idea remove 'unreasonable deltas'. highlight bad data:
select * ( select *,readinghours-lag(readinghours) over(order readingdate) deltalag, lead(readinghours) over(order readingdate)-readinghours deltalead, lead(readinghours) over(order readingdate)-lag(readinghours) over(order readingdate) deltaleadlag readings ) not(isnull(deltalag,0) between -10 , 500) , isnull(deltaleadlag,0) between -10 , 500
Comments
Post a Comment