]> git.immae.eu Git - github/fretlink/tap-google-sheets.git/blob - README.md
Update stitch validate json to validate with python 3.8 path (#32)
[github/fretlink/tap-google-sheets.git] / README.md
1 # tap-google-sheets
2
3 This is a [Singer](https://singer.io) tap that produces JSON-formatted data
4 following the [Singer
5 spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md).
6
7 This tap:
8
9 - Pulls raw data from the [Google Sheets v4 API](https://developers.google.com/sheets/api)
10 - Extracts the following endpoints:
11 - [File Metadata](https://developers.google.com/drive/api/v3/reference/files/get)
12 - [Spreadsheet Metadata](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get)
13 - [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
14 - Outputs the following metadata streams:
15 - File Metadata: Name, audit/change info from Google Drive
16 - Spreadsheet Metadata: Basic metadata about the Spreadsheet: Title, Locale, URL, etc.
17 - Sheet Metadata: Title, URL, Area (max column and row), and Column Metadata
18 - Column Metadata: Column Header Name, Data type, Format
19 - Sheets Loaded: Sheet title, load date, number of rows
20 - For each Sheet:
21 - Outputs the schema for each resource (based on the column header and datatypes of row 2, the first row of data)
22 - Outputs a record for all columns that have column headers, and for each row of data
23 - Emits a Singer ACTIVATE_VERSION message after each sheet is complete. This forces hard deletes on the data downstream if fewer records are sent.
24 - Primary Key for each row in a Sheet is the Row Number: `__sdc_row`
25 - Each Row in a Sheet also includes Foreign Keys to the Spreadsheet Metadata, `__sdc_spreadsheet_id`, and Sheet Metadata, `__sdc_sheet_id`.
26
27 ## API Endpoints
28 [**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
29 - Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version
30 - Primary keys: id
31 - Replication strategy: Incremental (GET file audit data for spreadsheet_id in config)
32 - Process/Transformations: Replicate Data if Modified
33
34 [**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
35 - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2
36 - This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information)
37 - Primary keys: Spreadsheet Id, Sheet Id, Column Index
38 - Foreign keys: None
39 - Replication strategy: Full (get and replace file metadata for spreadshee_id in config)
40 - Process/Transformations:
41 - Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area)
42 - sheetId, title, index, gridProperties (rowCount, columnCount)
43 - Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column name uniqueness
44 - Create/Verify Datatypes based on 2nd row value and cell metadata
45 - First check:
46 - [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue)
47 - Valid types: numberValue, stringValue, boolValue
48 - Invalid types: formulaValue, errorValue
49 - Then check:
50 - [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType)
51 - Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC
52 - Determine JSON schema column data type based on the value and the above cell metadata settings.
53 - If DATE, DATE_TIME, or TIME, set JSON schema format accordingly
54
55 [**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
56 - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS
57 - This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption)
58 - Primary keys: _sdc_row
59 - Replication strategy: Full (GET file audit data for spreadsheet_id in config)
60 - Process/Transformations:
61 - Loop through sheets (compared to catalog selection)
62 - Send metadata for sheet
63 - Loop through ALL columns for columns having a column header
64 - Loop through ranges of rows for ALL rows in sheet available area max row (from sheet metadata)
65 - Transform values, if necessary (dates, date-times, times, boolean).
66 - Date/time serial numbers converted to date, date-time, and time strings. Google Sheets uses Lotus 1-2-3 [Serial Number](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption) format for date/times. These are converted to normal UTC date-time strings.
67 - Process/send records to target
68
69 ## Authentication
70 The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters.
71 - Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly
72 - Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly
73 - Tap config.json parameters:
74 - client_id: identifies your application
75 - client_secret: authenticates your application
76 - refresh_token: generates an access token to authorize your session
77 - spreadsheet_id: unique identifier for each spreadsheet in Google Drive
78 - start_date: absolute minimum start date to check file modified
79 - user_agent: tap-name and email address; identifies your application in the Remote API server logs
80
81 ## Quick Start
82
83 1. Install
84
85 Clone this repository, and then install using setup.py. We recommend using a virtualenv:
86
87 ```bash
88 > virtualenv -p python3 venv
89 > source venv/bin/activate
90 > python setup.py install
91 OR
92 > cd .../tap-google-sheets
93 > pip install .
94 ```
95 2. Dependent libraries
96 The following dependent libraries were installed.
97 ```bash
98 > pip install target-json
99 > pip install target-stitch
100 > pip install singer-tools
101 > pip install singer-python
102 ```
103 - [singer-tools](https://github.com/singer-io/singer-tools)
104 - [target-stitch](https://github.com/singer-io/target-stitch)
105
106 3. Create your tap's `config.json` file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address).
107
108 ```json
109 {
110 "client_id": "YOUR_CLIENT_ID",
111 "client_secret": "YOUR_CLIENT_SECRET",
112 "refresh_token": "YOUR_REFRESH_TOKEN",
113 "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID",
114 "start_date": "2019-01-01T00:00:00Z",
115 "user_agent": "tap-google-sheets <api_user_email@example.com>"
116 }
117 ```
118
119 Optionally, also create a `state.json` file. `currently_syncing` is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off.
120 Only the `performance_reports` uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type.
121
122 ```json
123 {
124 "currently_syncing": "file_metadata",
125 "bookmarks": {
126 "file_metadata": "2019-09-27T22:34:39.000000Z"
127 }
128 }
129 ```
130
131 4. Run the Tap in Discovery Mode
132 This creates a catalog.json for selecting objects/fields to integrate:
133 ```bash
134 tap-google-sheets --config config.json --discover > catalog.json
135 ```
136 See the Singer docs on discovery mode
137 [here](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#discovery-mode).
138
139 5. Run the Tap in Sync Mode (with catalog) and [write out to state file](https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target)
140
141 For Sync mode:
142 ```bash
143 > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json
144 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
145 ```
146 To load to json files to verify outputs:
147 ```bash
148 > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json
149 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
150 ```
151 To pseudo-load to [Stitch Import API](https://github.com/singer-io/target-stitch) with dry run:
152 ```bash
153 > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json
154 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
155 ```
156
157 6. Test the Tap
158
159 While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices:
160 Pylint to improve [code quality](https://github.com/singer-io/getting-started/blob/master/docs/BEST_PRACTICES.md#code-quality):
161 ```bash
162 > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments
163 ```
164 Pylint test resulted in the following score:
165 ```bash
166 Your code has been rated at 9.78/10
167 ```
168
169 To [check the tap](https://github.com/singer-io/singer-tools#singer-check-tap) and verify working:
170 ```bash
171 > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
172 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
173 ```
174 Check tap resulted in the following:
175 ```bash
176 The output is valid.
177 It contained 3881 messages for 13 streams.
178
179 13 schema messages
180 3841 record messages
181 27 state messages
182
183 Details by stream:
184 +----------------------+---------+---------+
185 | stream | records | schemas |
186 +----------------------+---------+---------+
187 | file_metadata | 1 | 1 |
188 | spreadsheet_metadata | 1 | 1 |
189 | Test-1 | 9 | 1 |
190 | Test 2 | 2 | 1 |
191 | SKU COGS | 218 | 1 |
192 | Item Master | 216 | 1 |
193 | Retail Price | 273 | 1 |
194 | Retail Price NEW | 284 | 1 |
195 | Forecast Scenarios | 2681 | 1 |
196 | Promo Type | 91 | 1 |
197 | Shipping Method | 47 | 1 |
198 | sheet_metadata | 9 | 1 |
199 | sheets_loaded | 9 | 1 |
200 +----------------------+---------+---------+
201 ```
202 ---
203
204 Copyright &copy; 2019 Stitch