aboutsummaryrefslogtreecommitdiffhomepage
path: root/README.md
diff options
context:
space:
mode:
authorJeff Huth <jeff.huth@bytecode.io>2019-11-13 17:03:56 -0800
committerJeff Huth <jeff.huth@bytecode.io>2019-11-13 17:03:56 -0800
commit89643ba6fa98db82efd3246805ef801a8bfb5c81 (patch)
tree739027b4e827def2db81631c9d6ed58ec2b97809 /README.md
parent5f8005471d3affaaf23489df93a58ca64c3da3ca (diff)
downloadtap-google-sheets-89643ba6fa98db82efd3246805ef801a8bfb5c81.tar.gz
tap-google-sheets-89643ba6fa98db82efd3246805ef801a8bfb5c81.tar.zst
tap-google-sheets-89643ba6fa98db82efd3246805ef801a8bfb5c81.zip
Initial commit
Discovery mode works. Still working on normal sync.
Diffstat (limited to 'README.md')
-rw-r--r--README.md173
1 files changed, 172 insertions, 1 deletions
diff --git a/README.md b/README.md
index db00e68..6fcb0b9 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,172 @@
1tap-google-sheets 1# tap-google-sheets
2
3This is a [Singer](https://singer.io) tap that produces JSON-formatted data
4following the [Singer
5spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md).
6
7This tap:
8
9- Pulls raw data from the [Google Sheets v4 API](https://developers.google.com/sheets/api)
10- Extracts the following endpoints:
11 - [Spreadsheet File](https://developers.google.com/drive/api/v3/reference/files/get)
12 - [Spreadsheet Metadata, Sheets, Headers](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get)
13 - [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
14- For each Sheet:
15 - Outputs the schema for each resource (based on the column header and datatypes of first row of data)
16 - Outputs a record for all columns with column headers, and for each row of data until it reaches an empty row
17
18## API Endpoints
19[**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
20- Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version
21- Primary keys: id
22- Replication strategy: Full (GET file audit data for spreadsheet_id in config)
23- Process/Transformations: Replicate Data if Modified
24
25[**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
26- Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2
27- This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information)
28- Primary keys: spreadsheetId, title, field_name
29- Foreign keys: None
30- Replication strategy: Full (get and replace file metadata for spreadshee_id in config)
31- Process/Transformations:
32 - Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area)
33 - sheetId, title, index, gridProperties (rowCount, columnCount)
34 - Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column uniqueness
35 - Header's field_name, position: data.rowData[0].values[i].formattedValue
36 - Create/Verify Datatypes (2nd row):
37 - Row 2's datatype, format: data.rowData[1].values[i]
38 - First check:
39 - [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue)
40 - Valid types: numberValue, stringValue, boolValue
41 - Invalid types: formulaValue, errorValue
42 - Then check:
43 - [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType)
44 - Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC
45 - If DATE or DATE_TIME, set JSON schema datatype = string and format = date-time
46 - [effectiveFormat.numberFormat.pattern](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormat)
47
48[**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
49- Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS
50- This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption)
51- Primary keys: row
52- Replication strategy: Full (GET file audit data for spreadsheet_id in config)
53- Process/Transformations:
54 - Loop through sheets (compared to catalog selection)
55 - Send metadata for sheet
56 - Loop through ranges of rows until reaching empty row or area max row (from sheet metadata)
57 - Transform values, if necessary (dates, date-times, boolean, integer, numers)
58 - Process/send records
59
60## Authentication
61The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters.
62- Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly
63- Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly
64- Tap config.json parameters:
65 - client_id: identifies your application
66 - client_secret: authenticates your application
67 - refresh_token: generates an access token to authorize your session
68 - spreadsheet_id: unique identifier for each spreadsheet in Google Drive
69 - start_date: absolute minimum start date to check file modified
70 - user_agent: tap-name and email address; identifies your application in the Remote API server logs
71
72## Quick Start
73
741. Install
75
76 Clone this repository, and then install using setup.py. We recommend using a virtualenv:
77
78 ```bash
79 > virtualenv -p python3 venv
80 > source venv/bin/activate
81 > python setup.py install
82 OR
83 > cd .../tap-google-sheets
84 > pip install .
85 ```
862. Dependent libraries
87 The following dependent libraries were installed.
88 ```bash
89 > pip install target-json
90 > pip install target-stitch
91 > pip install singer-tools
92 > pip install singer-python
93 ```
94 - [singer-tools](https://github.com/singer-io/singer-tools)
95 - [target-stitch](https://github.com/singer-io/target-stitch)
96
973. Create your tap's `config.json` file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address).
98
99 ```json
100 {
101 "client_id": "YOUR_CLIENT_ID",
102 "client_secret": "YOUR_CLIENT_SECRET",
103 "refresh_token": "YOUR_REFRESH_TOKEN",
104 "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID",
105 "start_date": "2019-01-01T00:00:00Z",
106 "user_agent": "tap-google-sheets <api_user_email@example.com>"
107 }
108 ```
109
110 Optionally, also create a `state.json` file. `currently_syncing` is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off.
111 Only the `performance_reports` uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type.
112
113 ```json
114 {
115 "currently_syncing": "file_metadata",
116 "bookmarks": {
117 "file_metadata": "2019-09-27T22:34:39.000000Z",
118 "sheets_loaded": "2019-09-27T21:55:42.000000Z"
119 }
120 }
121 ```
122
1234. Run the Tap in Discovery Mode
124 This creates a catalog.json for selecting objects/fields to integrate:
125 ```bash
126 tap-google-sheets --config config.json --discover > catalog.json
127 ```
128 See the Singer docs on discovery mode
129 [here](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#discovery-mode).
130
1315. Run the Tap in Sync Mode (with catalog) and [write out to state file](https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target)
132
133 For Sync mode:
134 ```bash
135 > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json
136 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
137 ```
138 To load to json files to verify outputs:
139 ```bash
140 > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json
141 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
142 ```
143 To pseudo-load to [Stitch Import API](https://github.com/singer-io/target-stitch) with dry run:
144 ```bash
145 > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json
146 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
147 ```
148
1496. Test the Tap
150
151 While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices:
152 Pylint to improve [code quality](https://github.com/singer-io/getting-started/blob/master/docs/BEST_PRACTICES.md#code-quality):
153 ```bash
154 > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments
155 ```
156 Pylint test resulted in the following score:
157 ```bash
158 TBD
159 ```
160
161 To [check the tap](https://github.com/singer-io/singer-tools#singer-check-tap) and verify working:
162 ```bash
163 > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
164 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
165 ```
166 Check tap resulted in the following:
167 ```bash
168 TBD
169 ```
170---
171
172Copyright &copy; 2019 Stitch