项目作者: hpcc-systems

项目描述 :
HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer
高级语言: ECL
项目地址: git://github.com/hpcc-systems/DataPatterns.git
创建时间: 2016-09-29T14:48:25Z
项目社区:https://github.com/hpcc-systems/DataPatterns

开源协议:

下载


DataPatterns

DataPatterns is an ECL bundle that provides data profiling and
research tools to an ECL programmer.

Table of Contents

Installation

Note: Profile(), BestRecordStructure() and Benford() are
now included in the HPCC Systems platform! They have been added to the ECL
Standard Library (within Std.DataPatterns) and Profile() has also been
integrated within ECL Watch so you can create a profile from a saved logical file
using only a web browser. Note that the Std library version of Profile() will
create a visualization of the results only when executed from ECL Watch;
visualizations will not be generated if Std.DataPatterns.Profile() is
called from ECL code. If that is important to you, install this bundle
version instead (they coexist peacefully).

This code is installed as an ECL Bundle. Complete instructions for managing ECL
Bundles can be found in The ECL IDE and HPCC Client
Tools

documentation.

Use the ecl command line tool to install this bundle:

  1. ecl bundle install https://github.com/hpcc-systems/DataPatterns.git

You may have to either navigate to the client tools bin directory before
executing the command, or use the full path to the ecl tool.

After installation, all of the code here becomes available after you import it:

  1. IMPORT DataPatterns;

Note that is possible to use this code without installing it as a bundle. To do
so, simply make it available within your IDE and just ignore the Bundle.ecl
file. With the Windows IDE, the DataPatterns directory must not be a top-level
item in your repository list; it needs to be installed one level below the top
level, such as within your “My Files” folder.

Release Notes


Click to expand

|Version|Notes|
|:——:|:——-|
|1.0.0|Initial public release, finally with support for datasets defined using dynamic record lookup|
|1.0.1|Add ProfileFromPath and BestRecordStructureFromPath; ave_length bug fix|
|1.0.2|Change attribute field in CorrelationsRec embedded dataset to STRING|
|1.1.0|Add record count breakdown for low-cardinality field values; ProfileFromPath() returns correct record structure|
|1.1.1|Examine UTF8 values for alternate best_attribute_type data types rather than just passing them through|
|1.2.0|Add option to emit a suitable TRANSFORM function to BestRecordStructure and BestRecordStructureFromPath|
|1.2.1|Just-sprayed CSV files now supported within BestRecordStructureFromPath|
|1.2.2|Bug fix: Support datasets that contain reserved words as field names (e.g. loop)|
|1.3.0|Support for embedded child records; bug fix for proper computing of upper quartile value|
|1.3.1|Just-sprayed CSV files now supported within ProfileFromPath|
|1.3.2|Allow most CSV attributes to acquire default values in ProfileFromPath and BestRecordStructureFromPath|
|1.3.3|Add file kind gathering back to the code in ProfileFromPath and BestRecordStructureFromPath (regression from 1.3.2)|
|1.3.4|When given explicit numeric attribute types, refrain from recommending a “best” attribute type|
|1.3.5|Fix ordering of output in BestRecordStructure when TRANSFORM is emitted|
|1.4.0|Automatically include improved visual results of Profile, including data distribution graphs (within workunit’s Resources tab)|
|1.4.1|Regression: Fix self-tests that were failing due to changes in v1.3.4|
|1.4.2|String fields containing all numerics with leading zeros are now marked as string in best_attribute_type; string fields where the length varies by more than three orders of magnitude are now marked as string in best_attribute_type|
|1.5.0|Add support for SET OF data types and child datasets|
|1.5.1|Support for tabbed visual results of multiple profiles in a workunit’s result; changes to avoid symbol collision in calling ECL code; visual report styling update|
|1.5.2|Import the ECL Standard Library within the Profile() function macro so callers do not have to|
|1.5.3|Fix leading-zero numeric test, ensuring that only all-numeric values are considered as string type candidates|
|1.5.4|Fix tab issues that appeared when multiple profiling results were available|
|1.5.5|Fix visualized report vertical scrolling problems; update dependency to resolve security issue; removed erroneous HTML fragment from reports|
|1.5.7|Add NormalizeProfileResults() function macro (see below for details); fix ECL compiler problem accessing child datasets hosted within embedded child records; make sure empty child dataset information appears in the final output|
|1.6.0|is_numeric result is now based upon best_attribute_type rather than given_attribute_type, and the numeric_xxxx results will appear for those attributes as well; renamed numeric_correlations result to simply correlations||
|1.6.1|Fix problem where large datasets with implicit numeric conversions ran out of memory during the final phase of profiling|
|1.6.2|Fix issue where a record definition END would appear in the wrong place within BestRecordStructure(); remove BestRecordStructureFromPath() and ProfileFromPath() — they never worked in all circumstances|
|1.6.3|Fix issue where fields in the NewLayout record definition emitted by BestRecordStructure were out of order|
|1.6.4|Bump visualizer code, including dependencies, to latest versions; increase default lcbLimit value to 1000|
|1.6.5|Significant (~75%) performance boost within the text pattern code — thanks to Manjunath Venkataswamy for finding the issue|
|1.7.0|NormalizeProfileResults() now shows results for attributes within child datasets (text patterns, correlations, etc); addition of Benford() analysis function; add workaround to allow a child dataset to be cited in a fieldListStr argument in Profile()|
|1.7.1|Fix digit selection code in Benford|
|1.7.2|Benford: Recognize implied trailing zeros after a decimal point|
|1.8.0|Addition of Validation module; minor optimization in text pattern generation|
|1.8.1|Fix issue with correlation with a numeric field named ‘row’|
|1.8.2|Security: Bump Viz Versions|
|1.9.0|New functionality: Cardinality() function; improve handling of specific child dataset fields in fieldListStr parameter; security updates|
|1.9.1|Fix IMPORT in (Profile) Tests module; support UTF-8 strings in Mode values and example text patterns|
|1.9.2|Security updates|
|1.9.3|Better identify upper- and lower-case Unicode characters in text patterns; scan Unicode and UTF-8 strings to see if they can be represented with a STRING data type instead|
|1.9.4|README fixes and updates; improve UTF-8 detection and prevent buffer overruns during character scans; use short form of Unicode property names in regex|
|1.9.5|Correct Unicode regex regression introduced in 1.9.4|
|1.10.0|Security in visualization; expand “record count” fields from UNSIGNED4 to UNSIGNED6 — thanks to Manjunath Venkataswamy for requesting this improvement; add UTF8-specific TRIM and regex calls to avoid casting if possible|

Profile

Documentation as pulled from the beginning of Profile.ecl:

  1. Profile() is a function macro for profiling all or part of a dataset.
  2. The output is a dataset containing the following information for each
  3. profiled attribute:
  4. attribute The name of the attribute
  5. given_attribute_type The ECL type of the attribute as it was defined
  6. in the input dataset
  7. best_attribute_type An ECL data type that both allows all values
  8. in the input dataset and consumes the least
  9. amount of memory
  10. rec_count The number of records analyzed in the dataset;
  11. this may be fewer than the total number of
  12. records, if the optional sampleSize argument
  13. was provided with a value less than 100
  14. fill_count The number of rec_count records containing
  15. non-nil values; a 'nil value' is an empty
  16. string, a numeric zero, or an empty SET; note
  17. that BOOLEAN attributes are always counted as
  18. filled, regardless of their value; also,
  19. fixed-length DATA attributes (e.g. DATA10) are
  20. also counted as filled, given their typical
  21. function of holding data blobs
  22. fill_rate The percentage of rec_count records containing
  23. non-nil values; this is basically
  24. fill_count / rec_count * 100
  25. cardinality The number of unique, non-nil values within
  26. the attribute
  27. cardinality_breakdown For those attributes with a low number of
  28. unique, non-nil values, show each value and the
  29. number of records containing that value; the
  30. lcbLimit parameter governs what "low number"
  31. means
  32. modes The most common values in the attribute, after
  33. coercing all values to STRING, along with the
  34. number of records in which the values were
  35. found; if no value is repeated more than once
  36. then no mode will be shown; up to five (5)
  37. modes will be shown; note that string values
  38. longer than the maxPatternLen argument will
  39. be truncated
  40. min_length For SET datatypes, the fewest number of elements
  41. found in the set; for other data types, the
  42. shortest length of a value when expressed
  43. as a string; null values are ignored
  44. max_length For SET datatypes, the largest number of elements
  45. found in the set; for other data types, the
  46. longest length of a value when expressed
  47. as a string; null values are ignored
  48. ave_length For SET datatypes, the average number of elements
  49. found in the set; for other data types, the
  50. average length of a value when expressed
  51. popular_patterns The most common patterns of values; see below
  52. rare_patterns The least common patterns of values; see below
  53. is_numeric Boolean indicating if the original attribute
  54. was a numeric scalar or if the best_attribute_type
  55. value was a numeric scaler; if TRUE then the
  56. numeric_xxxx output fields will be
  57. populated with actual values; if this value
  58. is FALSE then all numeric_xxxx output values
  59. should be ignored
  60. numeric_min The smallest non-nil value found within the
  61. attribute as a DECIMAL; this value is valid only
  62. if is_numeric is TRUE; if is_numeric is FALSE
  63. then zero will show here
  64. numeric_max The largest non-nil value found within the
  65. attribute as a DECIMAL; this value is valid only
  66. if is_numeric is TRUE; if is_numeric is FALSE
  67. then zero will show here
  68. numeric_mean The mean (average) non-nil value found within
  69. the attribute as a DECIMAL; this value is valid only
  70. if is_numeric is TRUE; if is_numeric is FALSE
  71. then zero will show here
  72. numeric_std_dev The standard deviation of the non-nil values
  73. in the attribute as a DECIMAL; this value is valid only
  74. if is_numeric is TRUE; if is_numeric is FALSE
  75. then zero will show here
  76. numeric_lower_quartile The value separating the first (bottom) and
  77. second quarters of non-nil values within
  78. the attribute as a DECIMAL; this value is valid only
  79. if is_numeric is TRUE; if is_numeric is FALSE
  80. then zero will show here
  81. numeric_median The median non-nil value within the attribute
  82. as a DECIMAL; this value is valid only
  83. if is_numeric is TRUE; if is_numeric is FALSE
  84. then zero will show here
  85. numeric_upper_quartile The value separating the third and fourth
  86. (top) quarters of non-nil values within
  87. the attribute as a DECIMAL; this value is valid only
  88. if is_numeric is TRUE; if is_numeric is FALSE
  89. then zero will show here
  90. correlations A child dataset containing correlation values
  91. comparing the current numeric attribute with all
  92. other numeric attributes, listed in descending
  93. correlation value order; the attribute must be
  94. a numeric ECL datatype; non-numeric attributes
  95. will return an empty child dataset; note that
  96. this can be a time-consuming operation,
  97. depending on the number of numeric attributes
  98. in your dataset and the number of rows (if you
  99. have N numeric attributes, then
  100. N * (N - 1) / 2 calculations are performed,
  101. each scanning all data rows)
  102. Most profile outputs can be disabled. See the 'features' argument, below.
  103. Data patterns can give you an idea of what your data looks like when it is
  104. expressed as a (human-readable) string. The function converts each
  105. character of the string into a fixed character palette to produce a "data
  106. pattern" and then counts the number of unique patterns for that attribute.
  107. The most- and least-popular patterns from the data will be shown in the
  108. output, along with the number of times that pattern appears and an example
  109. (randomly chosen from the actual data). The character palette used is:
  110. A Any uppercase letter
  111. a Any lowercase letter
  112. 9 Any numeric digit
  113. B A boolean value (true or false)
  114. All other characters are left as-is in the pattern.
  115. Function parameters:
  116. @param inFile The dataset to process; this could be a child
  117. dataset (e.g. inFile.childDS); REQUIRED
  118. @param fieldListStr A string containing a comma-delimited list of
  119. attribute names to process; use an empty string to
  120. process all attributes in inFile; OPTIONAL,
  121. defaults to an empty string
  122. @param maxPatterns The maximum number of patterns (both popular and
  123. rare) to return for each attribute; OPTIONAL,
  124. defaults to 100
  125. @param maxPatternLen The maximum length of a pattern; longer patterns
  126. are truncated in the output; this value is also
  127. used to set the maximum length of the data to
  128. consider when finding cardinality and mode values;
  129. must be 33 or larger; OPTIONAL, defaults to 100
  130. @param features A comma-delimited string listing the profiling
  131. elements to be included in the output; OPTIONAL,
  132. defaults to a comma-delimited string containing all
  133. of the available keywords:
  134. KEYWORD AFFECTED OUTPUT
  135. fill_rate fill_rate
  136. fill_count
  137. cardinality cardinality
  138. cardinality_breakdown cardinality_breakdown
  139. best_ecl_types best_attribute_type
  140. modes modes
  141. lengths min_length
  142. max_length
  143. ave_length
  144. patterns popular_patterns
  145. rare_patterns
  146. min_max numeric_min
  147. numeric_max
  148. mean numeric_mean
  149. std_dev numeric_std_dev
  150. quartiles numeric_lower_quartile
  151. numeric_median
  152. numeric_upper_quartile
  153. correlations correlations
  154. To omit the output associated with a single keyword,
  155. set this argument to a comma-delimited string
  156. containing all other keywords; note that the
  157. is_numeric output will appear only if min_max,
  158. mean, std_dev, quartiles, or correlations features
  159. are active; also note that enabling the
  160. cardinality_breakdown feature will also enable
  161. the cardinality feature, even if it is not
  162. explicitly enabled
  163. @param sampleSize A positive integer representing a percentage of
  164. inFile to examine, which is useful when analyzing a
  165. very large dataset and only an estimated data
  166. profile is sufficient; valid range for this
  167. argument is 1-100; values outside of this range
  168. will be clamped; OPTIONAL, defaults to 100 (which
  169. indicates that the entire dataset will be analyzed)
  170. @param lcbLimit A positive integer (<= 1000) indicating the maximum
  171. cardinality allowed for an attribute in order to
  172. emit a breakdown of the attribute's values; this
  173. parameter will be ignored if cardinality_breakdown
  174. is not included in the features argument; OPTIONAL,
  175. defaults to 64

Here is a very simple example of executing the full data profiling code:

  1. IMPORT DataPatterns;
  2. filePath := '~thor::my_sample_data';
  3. ds := DATASET(filePath, RECORDOF(filePath, LOOKUP), FLAT);
  4. profileResults := DataPatterns.Profile(ds);
  5. OUTPUT(profileResults, ALL, NAMED('profileResults'));

Profile(): Summary Report with Graphs

A report is generated based on the output of Profile(). The report is
accessible via a Workunit’s Resources tab within ECL Watch. For example:

Screen capture displaying active Resources tab

Every attribute in the Profile result is represented by a row of information.
Each row of information is organized into several columns. Here is a short description
of each column:

  1. Type information, Cardinality Count & Filled Count
  2. Min, Avg, Max Length (for string attributes) or Mean, Std. Deviation, Quartiles (for numeric attributes)
  3. Quartile bell curve and candlestick
    • only shown for attributes with is_numeric === true
    • this column is omitted if the above condition fails for all attributes
  4. Cardinality Breakdown listed by count descending
    • only shown for attributes with cardinality_breakdown content
    • this column is omitted if the above condition fails for all attributes
  5. Popular Patterns
    • only shown for attributes with popular_patterns content
    • this column is omitted if the above condition fails for all attributes

This is a screen capture displaying a report row for a string attribute
(“Test_Name”) and a numeric attribute (“Test_Score”):

Screen capture of two report rows


NormalizeProfileResults

The result of a call to Profile is a rich dataset.
There are several fields (depending on the features requested) and some
of them can include child datasets embedded for each field from the dataset
being profiled.

In some circumstances, it would be advantageous to save the profile results
in a more normalized format. For instance, a normalized format would allow
the task of comparing one profile result to another to be much easier.

NormalizeProfileResults accepts only one argument: the dataset representing
the result of a call to either Profile. The result
is a dataset in the following format:

  1. RECORD
  2. STRING attribute; // Field from profiled dataset
  3. STRING key; // Field from profile results
  4. STRING value; // Value from profile results
  5. END;

Some profile results are represented with embedded child datasets (modes,
cardinality breakdowns, text patterns, and correlations). When normalizing,
portions of these child datasets are converted to string values delimited
by the ‘|’ character. If records within the child dataset contain
additional information, such as a record count, the additional information
is delimited with a ‘:’ character.

Sample code:

  1. IMPORT DataPatterns;
  2. filePath := '~thor::my_sample_data';
  3. ds := DATASET(filePath, RECORDOF(filePath, LOOKUP), FLAT);
  4. profileResults := DataPatterns.Profile(ds);
  5. normalizedResults := DataPatterns.NormalizeProfileResults(profileResults);
  6. OUTPUT(normalizedResults, ALL, NAMED('normalizedResults'));

profileResults:

attribute given_attribute_type rec_count fill_count fill_rate popular_patterns
field1 string 1000 1000 100
data_patternsrec_count
AAAAAA10
AAA5

normalizedResults:

attribute key value
field1 given_attribute_type string
field1 rec_count 1000
field1 fill_count 1000
field1 fill_rate 100
field1 popular_patterns AAAAAA:10 AAA:5

BestRecordStructure

This is a function macro that, given a dataset, returns a recordset containing
the “best” record definition for the given dataset. By default, the entire
dataset will be examined. You can override this behavior by providing a
percentage of the dataset to examine (1-100) as the second argument. This is
useful if you are checking a very large file and are confident that a sample
will provide correct results.

Sample call:

  1. IMPORT DataPatterns;
  2. filePath := '~thor::my_sample_data';
  3. ds := DATASET(filePath, RECORDOF(filePath, LOOKUP), FLAT);
  4. recordDefinition := DataPatterns.BestRecordStructure(ds);
  5. OUTPUT(recordDefinition, NAMED('recordDefinition'), ALL);

The result will be a recordset containing only a STRING field. The first
record will always contain ‘RECORD’ and the last record will always contain
‘END;’. The records in between will contain declarations for the attributes
found within the given dataset. The entire result can be copied and pasted
into an ECL code module.

Note that, when outputing the result of BestRecordStructure to a workunit,
it is a good idea to add an ALL flag to the OUTPUT function. This ensures that
all attributes will be displayed. Otherwise, if you have more than 100
attributes in the given dataset, the result will be truncated.


Cardinality

A portion of Profile() deals with cardinality. If there is a low-enough number of
unique values within an attribute, Profile() will automatically show those values
along with the count of the number of records with each value. But what if you’re
really interested in those values and want to see them all? No matter how many
there are? Enter the Cardinality() function macro.

Cardinality() finds all the unique values in one or more fields and displays the
count of the number of records for each value, without limitation on the number of
fields or the number of found values. The result is a simple three-field dataset:

  1. STRING attribute;
  2. UTF8 value;
  3. UNSIGNED8 rec_count;

The only required parameter to Cardinality() is a dataset to process. You can
optionally provide a comma-delimited string naming specific fields, if you don’t
want to process all of the fields. You can also limit the analysis to only a portion
of the dataset (though that is of probably limited usefulness).

Sample call:

  1. IMPORT DataPatterns;
  2. filePath := '~thor::my_sample_data';
  3. ds := DATASET(filePath, RECORDOF(filePath, LOOKUP), FLAT);
  4. cardinalityResults := DataPatterns.Cardinality(ds);
  5. OUTPUT(cardinalityResults, NAMED('cardinalityResults'));

See the comments at the beginning of the Cardinality.ecl file
for more details.


Data Validation Submodule

Validation exists as a submodule within DataPatterns. It contains two function
macros: Validate() and Fix().

Validate() provides an easy mechanism for testing expected field values at
the record level, then append those test results to each record in a
standardized layout. Tests are named, and associated with each test is
a bit of ECL that defines what a valid field should look like. Fields with
values that do not pass that test are flagged.

Fix() is the other half of the testing: Once you have the output from
Validate() you will need to handle the failing field values somehow. The
Fix() function macro processes records with failures and gives you the
opportunity to correct the error or to omit the record entirely.

Validation.Validate()

Documentation as pulled from Validation.ecl:

Validation checks are defined within a semicolon-delimited STRING. Each check
should be in the following format:

  1. <test_name>:<test_ecl>

test_name should be a name somehow representing the check that is
being performed. The name will be included in the appended data if the
check fails. This name should clearly (but succinctly) describe what is
being tested. There is no requirement for a test_name to be unique
(and there some use cases where you may not want it unique at all) but,
in general, the name should be unique within a single Validate() call.
Names should start with a letter and may contain letters, numbers, periods,
dashes, and underscores.

test_ecl is ECL code that performs the test. If a string literal is
included in the test then the apostrophes must be escaped because the test
is being defined within a string. If a REGEXFIND() or REGEXREPLACE()
function is used and anything within the pattern needs to be escaped then
the backslash must be double-escaped. ECL already requires a single escape
(\\. or \\d) but including it in a test here means you have to
double-escape the backslash: \\\\. or \\\\d.

The ECL code used during the test is executed within the scope of a single
dataset record. Syntax-wise, it is similar to creating an ECL filter clause.
Like a filter, the ECL should evaluate to a BOOLEAN result and what you want
to do is return TRUE if the data being tested is valid. Invalid results,
where the ECL returns FALSE, are what is appended to the dataset.

Validate() imports the Std ECL library, so all standard library functions
are available for use within a test. Also, because Validate() is a function
macro, any function that is in scope when Validate() is called may also be
used within a test. This provides quite a bit of flexibility when it comes
to writing tests. The example code below references StartsWithAA() which
is an example of one of these user-supplied tests.

Validate() also includes a few internally-defined functions for use within
your tests as a convenience. Some are coercion functions that alter a field’s
value, others are test functions. These tests are not available for use in
your own custom, externally-defined tests.

Coercion helpers:

  1. OnlyDigits(s) Convert a single argument to a string and remove
  2. everything but numeric digits; returns a STRING
  3. OnlyChars(s) Convert a single argument to a UTF-8 string and remove
  4. everything but alphabetic characters; returns a
  5. UTF8 string
  6. WithoutPunct(s) Convert a single argument to a UTF-8 string and remove
  7. all punctuation characters; returns a UTF8 string
  8. Patternize(s) Create a 'text pattern' from the single argument,
  9. mapping character classes to a fixed palette:
  10. lowercase character -> a
  11. uppercase character -> A
  12. numeric digit -> 9
  13. everything else -> unchanged
  14. The result is returned as a UTF8 string

Value testing helpers:

  1. StrLen(s) Convert a single argument to a UTF-8 string and return
  2. its length as an unsigned integer
  3. IsOnlyDigits(s) Return TRUE if every character in the value is a digit
  4. IsOnlyUppercase(s) Return TRUE if every character in the value is an
  5. uppercase character
  6. IsOnlyLowercase(s) Return TRUE if every character in the value is a
  7. lowercase character
  8. IsDecimalNumber(s) Return TRUE if the value is a number, possibly prefixed
  9. by a negative sign, and possibly including a decimal
  10. portion

Record-level testing helpers:

  1. AllFieldsFilled() Tests every top-level field in the record by coercing
  2. the values to STRING and seeing if any of them are empty;
  3. returns TRUE if no field value is an empty string; note
  4. that this function accepts no argument

Example test specifications:

  1. MyValueIsPos:my_value > 0 // my_value must be greater than zero
  2. SomeNumInRange:some_num BETWEEN 50 AND 100 // some_num must be 50..100
  3. FIPSLength:StrLen(fips) = 5 // length of FIPS code must be 5
  4. DatesOrdered:dateBegin <= dateEnd // make sure dates are not flipped

Here is a complete example:

  1. IMPORT DataPatterns;
  2. filePath := '~thor::stock_data.txt';
  3. DataRec := RECORD
  4. STRING trade_date;
  5. STRING exchange_code;
  6. STRING stock_symbol;
  7. STRING opening_price;
  8. STRING high_price;
  9. STRING low_price;
  10. STRING closing_price;
  11. STRING shares_traded;
  12. STRING share_value;
  13. END;
  14. ds := DATASET(filePath, DataRec, CSV(SEPARATOR('\t'), HEADING(1)));
  15. // Custom, external field validation functions
  16. StartsWithAA(STRING s) := s[1..2] = 'AA';
  17. IsValidPrice(STRING price) := NOT(REGEXFIND('^\\d+?00$', price) AND (UNSIGNED)price >= 10000);
  18. checks := 'NonZeroLowPrice:(REAL)low_price > 0'
  19. + '; NonZeroHighPrice:(REAL)high_price > 0'
  20. + '; LowPriceLessOrEqualToHighPrice:(REAL)low_price <= (REAL)high_price'
  21. + '; OpeningPriceGreaterThanOne:(REAL)opening_price > 1'
  22. + '; OpeningPriceFormat:REGEXFIND(U8\'9+(\\\\.9{1,2})?\', Patternize(opening_price))'
  23. + '; OpeningPriceValid:IsValidPrice(opening_price)'
  24. + '; ClosingPriceValid:IsValidPrice(closing_price)'
  25. + '; SymbolStartsWithAA:StartsWithAA(stock_symbol)'
  26. + '; EveryFieldPresent:AllFieldsFilled()'
  27. ;
  28. validationResult := DataPatterns.Validation.Validate(ds, specStr := checks);
  29. OUTPUT(validationResult, {validationResult}, '~thor::stock_data_validated', OVERWRITE, COMPRESSED);

Validation.Fix()

Fixes are defined within a semicolon-delimited STRING. Each fix should
be in the following format:

  1. <membership_test>:<fix_ecl>

membership_test is a logical clause testing whether one or more tests
from the Validate() function is true for that record. The entries here
correspond to the test_name entries from the Validate() function and
they can optionally form a boolean expression using AND and OR operators.
At its simplest, a membership_test is just a single test_name entry and
it will be interpreted as the following ECL:

  1. ('test_name' IN vaidation_results.violations)

More complex boolean expressions will use that as the basis. For instance,
testing for “test_name_1 OR test_name_2“ — meaning, if either of the two
validation checks failed, execute the fix_ecl code — would be interpreted as the
following ECL:

  1. (('test_name_1' IN vaidation_results.violations)
  2. OR
  3. ('test_name_2' IN vaidation_results.violations))

The NOT() operator is also available, so testing for the absence of a
validation is supported.

fix_ecl is ECL code that fixes the problem. The most basic fix is
redefining a field value (e.g. my_field := new_value_expression).
If a string literal is included in the fix then the apostrophes must be
escaped because it is being defined within a string. If a REGEXFIND()
or REGEXREPLACE() function is used and anything within the pattern needs
to be escaped then the backslash must be double-escaped. ECL already
requires a single escape (\\. or \\d) but including it in a test here
means you have to double-escape the backslash: \\\\. or \\\\d.

The ECL code used during the fix is executed within the scope of a single
dataset record. This means that the expression may reference any field
in the record. There is no need to include SELF or LEFT scoping prefixes
when citing a dataset field name.

Fix() imports the Std ECL library, so all standard library functions
are available for use within a fix. Also, because Fix() is a function
macro, any function that is in scope when Fix() is called may also be
used within a fix.

Fix() also includes a few internally-defined functions for use within
your fixes as a convenience:

  1. OnlyDigits(s) Convert a single argument to a UTF-8 string and remove
  2. everything but numeric digits
  3. OnlyChars(s) Convert a single argument to a UTF-8 string and remove
  4. everything but alphabetic characters
  5. WithoutPunct(s) Convert a single argument to a UTF-8 string and remove
  6. all punctuation characters
  7. Swap(f1, f2) Swap the contents of two named fields
  8. SkipRecord() Remove the current record from the dataset

Here is a complete example:

  1. IMPORT DataPatterns;
  2. ValRec := RECORD
  3. UNSIGNED2 num_violations;
  4. SET OF STRING violations;
  5. END;
  6. LAYOUT := RECORD
  7. STRING trade_date;
  8. STRING exchange_code;
  9. STRING stock_symbol;
  10. STRING opening_price;
  11. STRING high_price;
  12. STRING low_price;
  13. STRING closing_price;
  14. STRING shares_traded;
  15. STRING share_value;
  16. ValRec validation_results;
  17. END;
  18. ds := DATASET('~thor::stock_data_validated', LAYOUT, FLAT);
  19. repairs := 'LowPriceLessThanOrEqualToHighPrice:Swap(high_price, low_price)'
  20. + '; OpeningPriceValid AND ClosingPriceValid:SkipRecord()'
  21. + '; OpeningPriceGreaterThanOne:opening_price := \'2\''
  22. ;
  23. repairResults := DataPatterns.Validation.Fix(ds, specStr := repairs);
  24. OUTPUT(repairResults, {repairResults}, '~thor::stock_data_fixed', OVERWRITE, COMPRESSED);

Benford

Benford’s law, also called the Newcomb–Benford law, or the law of anomalous
numbers, is an observation about the frequency distribution of leading digits
in many real-life sets of numerical data.

Benford’s law doesn’t apply to every set of numbers, but it usually applies
to large sets of naturally occurring numbers with some connection like:

  • Companies’ stock market values
  • Data found in texts — like the Reader’s Digest, or a copy of Newsweek
  • Demographic data, including state and city populations
  • Income tax data
  • Mathematical tables, like logarithms
  • River drainage rates
  • Scientific data

The law usually doesn’t apply to data sets that have a stated minimum and
maximum, like interest rates or hourly wages. If numbers are assigned,
rather than naturally occurring, they will also not follow the law. Examples
of assigned numbers include: zip codes, telephone numbers and Social
Security numbers.

For more information: https://en.wikipedia.org/wiki/Benford%27s_law

Note: This function is also available in the ECL Standard Library
as Std.DataPatterns.Benford() as of HPCC version 7.12.0.

Documentation as pulled from the beginning of Benford.ecl:

  1. Note that when computing the distribution of the most significant digit,
  2. the digit zero is ignored. So for instance, the values 0100, 100, 1.0,
  3. 0.10, and 0.00001 all have a most-significant digit of '1'. The digit
  4. zero is considered for all other positions.
  5. @param inFile The dataset to process; REQUIRED
  6. @param fieldListStr A string containing a comma-delimited list of
  7. attribute names to process; note that attributes
  8. listed here must be top-level attributes (not child
  9. records or child datasets); use an empty string to
  10. process all top-level attributes in inFile;
  11. OPTIONAL, defaults to an empty string
  12. @param digit The 1-based digit within the number to examine; the
  13. first significant digit is '1' and it only increases;
  14. OPTIONAL, defaults to 1, meaning the most-significant
  15. non-zero digit
  16. @param sampleSize A positive integer representing a percentage of
  17. inFile to examine, which is useful when analyzing a
  18. very large dataset and only an estimated data
  19. analysis is sufficient; valid range for this
  20. argument is 1-100; values outside of this range
  21. will be clamped; OPTIONAL, defaults to 100 (which
  22. indicates that all rows in the dataset will be used)
  23. @return A new dataset with the following record structure:
  24. RECORD
  25. STRING attribute; // Name of data attribute examined
  26. DECIMAL4_1 zero; // Percentage of rows with digit of '0'
  27. DECIMAL4_1 one; // Percentage of rows with digit of '1'
  28. DECIMAL4_1 two; // Percentage of rows with digit of '2'
  29. DECIMAL4_1 three; // Percentage of rows with digit of '3'
  30. DECIMAL4_1 four; // Percentage of rows with digit of '4'
  31. DECIMAL4_1 five; // Percentage of rows with digit of '5'
  32. DECIMAL4_1 six; // Percentage of rows with digit of '6'
  33. DECIMAL4_1 seven; // Percentage of rows with digit of '7'
  34. DECIMAL4_1 eight; // Percentage of rows with digit of '8'
  35. DECIMAL4_1 nine; // Percentage of rows with digit of '9'
  36. DECIMAL7_3 chi_squared; // Chi-squared "fitness test" result
  37. UNSIGNED8 num_values; // Number of rows with non-zero values for this attribute
  38. END;
  39. The named digit fields (e.g. "zero" and "one" and so on) represent the
  40. digit found in the 'digit' position of the associated attribute. The values
  41. that appear there are percentages. num_values shows the number of
  42. non-zero values processed, and chi_squared shows the result of applying
  43. that test using the observed vs expected distribution values.
  44. The first row of the results will show the expected values for the named
  45. digits, with "-- EXPECTED DIGIT n --" showing as the attribute name.'n' will
  46. be replaced with the value of 'digit' which indicates which digit position
  47. was examined.

Sample call:

  1. IMPORT DataPatterns;
  2. filePath := '~thor::stock_data_';
  3. DataRec := RECORD
  4. UNSIGNED4 trade_date;
  5. STRING1 exchange_code;
  6. STRING9 stock_symbol;
  7. DECIMAL9_2 opening_price;
  8. DECIMAL9_2 high_price;
  9. DECIMAL9_2 low_price;
  10. DECIMAL9_2 closing_price;
  11. UNSIGNED4 shares_traded;
  12. UNSIGNED4 share_value;
  13. END;
  14. ds := DATASET(filePath, DataRec, FLAT);
  15. // Analyze only the opening_price, closing_price, and trade_date attributes
  16. benfordResult := DataPatterns.Benford(ds, 'opening_price, closing_price, trade_date');
  17. OUTPUT(benfordResult, NAMED('benfordResult'), ALL);

The result would look something like the following:

attribute zero one two three four five six seven eight nine chi_squared num_values
— EXPECTED DIGIT 1 — -1 30.1 17.6 12.5 9.7 7.9 6.7 5.8 5.1 4.6 20.09 20959177
opening_price -1 31.7 20 13.3 9.7 7.2 5.7 4.8 4.1 3.6 1.266 19082595
closing_price -1 31.7 20 13.3 9.7 7.2 5.7 4.8 4 3.6 1.307 19083933
trade_date -1 0 100 0 0 0 0 0 0 0 468.182 20959177

The result contains the attribute name, expected and actual distributions of the digit
as a percentage, the chi-squared computation indicating how well that attribute
adheres to Benford’s Law, and the number of records actually considered.

By definition, the most-significant digit will never be zero. Therefore, when computing the
distribution of the most-significant digit, the ‘zero’ field will show -1 for all
attributes in the result.

The chi_squared column represents the critical value for a chi-squared test. If an
attribute’s chi_squared value is greater than the expected chi_squared value then that
attribute does not follow Benford’s Law.

In the above example, the trade_date attribute fails the chi-squared test, as 468.182 > 20.09.
This makes sense, because the data in that attribute is a date in YYYYMMDD format represented
as an unsigned integer, and the dataset contains stock data for only the past few years.


Profile() Testing

The data profiling code can be easily tested with the included Tests module.
hthor or ROXIE should be used to execute the tests, simply because Thor takes a
relatively long time to execute them. Here is how you invoke the tests:

  1. IMPORT DataPatterns;
  2. EVALUATE(DataPatterns.Tests);

If the tests pass then the execution will succeed and there will be no output.
These tests may take some time to execute on Thor. They run much faster on
either hthor or ROXIE, due to the use of small inline datasets.